iT邦幫忙

2023 iThome 鐵人賽

DAY 24
0
AI & Data

利用SeamlessM4T學習語音辨識架構及應用系列 第 24

DAY24 - 如何訓練Transformer模型

  • 分享至 

  • xImage
  •  

Transformer模型包含了注意力(attention)及自注意力(self-attention)且不斷發展技術,可以偵測一個序列中用特別的方式相互影響和相互依賴的資料元素。Transformer 模型已取代數年前最熱門的深度學習模型,卷積和遞歸神經網路(CNN 和 RNN)。下圖為Transformer結構,左半邊為encoder,右半邊為decoder。

基本組件

  • 建立Word Embeddings

    class Embedding(nn.Module):
        def __init__(self, vocab_size, embed_dim):
            """
            Args:
                vocab_size: size of vocabulary
                embed_dim: dimension of embeddings
            """
            super(Embedding, self).__init__()
            self.embed = nn.Embedding(vocab_size, embed_dim)
        def forward(self, x):
            """
            Args:
                x: input vector
            Returns:
                out: embedding vector
            """
            out = self.embed(x)
            return out
    
  • 位置編碼(Positional Encoding)

    # register buffer in Pytorch ->
    # If you have parameters in your model, which should be saved and restored in the state_dict,
    # but not trained by the optimizer, you should register them as buffers.
    
    class PositionalEmbedding(nn.Module):
        def __init__(self,max_seq_len,embed_model_dim):
            """
            Args:
                seq_len: length of input sequence
                embed_model_dim: demension of embedding
            """
            super(PositionalEmbedding, self).__init__()
            self.embed_dim = embed_model_dim
    
            pe = torch.zeros(max_seq_len,self.embed_dim)
            for pos in range(max_seq_len):
                for i in range(0,self.embed_dim,2):
                    pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/self.embed_dim)))
                    pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/self.embed_dim)))
            pe = pe.unsqueeze(0)
            self.register_buffer('pe', pe)
    
    		def forward(self, x):
            """
            Args:
                x: input vector
            Returns:
                x: output
            """
    
            # make embeddings relatively larger
            x = x * math.sqrt(self.embed_dim)
            #add constant to embedding
            seq_len = x.size(1)
            x = x + torch.autograd.Variable(self.pe[:,:seq_len], requires_grad=False)
            return x
    
  • 自注意力(Self Attention)

    class MultiHeadAttention(nn.Module):
        def __init__(self, embed_dim=512, n_heads=8):
            """
            Args:
                embed_dim: dimension of embeding vector output
                n_heads: number of self attention heads
            """
            super(MultiHeadAttention, self).__init__()
    
            self.embed_dim = embed_dim    #512 dim
            self.n_heads = n_heads   #8
            self.single_head_dim = int(self.embed_dim / self.n_heads)   #512/8 = 64  . each key,query, value will be of 64d
    
            #key,query and value matrixes    #64 x 64   
            self.query_matrix = nn.Linear(self.single_head_dim , self.single_head_dim ,bias=False)  # single key matrix for all 8 keys #512x512
            self.key_matrix = nn.Linear(self.single_head_dim  , self.single_head_dim, bias=False)
            self.value_matrix = nn.Linear(self.single_head_dim ,self.single_head_dim , bias=False)
            self.out = nn.Linear(self.n_heads*self.single_head_dim ,self.embed_dim)
    
    		def forward(self,key,query,value,mask=None):    #batch_size x sequence_length x embedding_dim    # 32 x 10 x 512
    
            """
            Args:
               key : key vector
               query : query vector
               value : value vector
               mask: mask for decoder
    
            Returns:
               output vector from multihead attention
            """
            batch_size = key.size(0)
            seq_length = key.size(1)
    
            # query dimension can change in decoder during inference. 
            # so we cant take general seq_length
            seq_length_query = query.size(1)
    
    				# 32x10x512
            key = key.view(batch_size, seq_length, self.n_heads, self.single_head_dim)  #batch_size x sequence_length x n_heads x single_head_dim = (32x10x8x64)
            query = query.view(batch_size, seq_length_query, self.n_heads, self.single_head_dim) #(32x10x8x64)
            value = value.view(batch_size, seq_length, self.n_heads, self.single_head_dim) #(32x10x8x64)
    
            k = self.key_matrix(key)       # (32x10x8x64)
            q = self.query_matrix(query)   
            v = self.value_matrix(value)
    
            q = q.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)    # (32 x 8 x 10 x 64)
            k = k.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
            v = v.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
    
            # computes attention
            # adjust key for matrix multiplication
            k_adjusted = k.transpose(-1,-2)  #(batch_size, n_heads, single_head_dim, seq_ken)  #(32 x 8 x 64 x 10)
            product = torch.matmul(q, k_adjusted)  #(32 x 8 x 10 x 64) x (32 x 8 x 64 x 10) = #(32x8x10x10)
    
    				# fill those positions of product matrix as (-1e20) where mask positions are 0
            if mask is not None:
                 product = product.masked_fill(mask == 0, float("-1e20"))
    
            #divising by square root of key dimension
            product = product / math.sqrt(self.single_head_dim) # / sqrt(64)
    
            #applying softmax
            scores = F.softmax(product, dim=-1)
    
            #mutiply with value matrix
            scores = torch.matmul(scores, v)  ##(32x8x 10x 10) x (32 x 8 x 10 x 64) = (32 x 8 x 10 x 64) 
    
            #concatenated output
            concat = scores.transpose(1,2).contiguous().view(batch_size, seq_length_query, self.single_head_dim*self.n_heads)  # (32x8x10x64) -> (32x10x8x64)  -> (32,10,512)
    
            output = self.out(concat) #(32,10,512) -> (32,10,512)
    
            return output
    

Encoder

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(TransformerBlock, self).__init__()
        
        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = MultiHeadAttention(embed_dim, n_heads)
        
        self.norm1 = nn.LayerNorm(embed_dim) 
        self.norm2 = nn.LayerNorm(embed_dim)
        
        self.feed_forward = nn.Sequential(
                          nn.Linear(embed_dim, expansion_factor*embed_dim),
                          nn.ReLU(),
                          nn.Linear(expansion_factor*embed_dim, embed_dim)
        )

        self.dropout1 = nn.Dropout(0.2)
        self.dropout2 = nn.Dropout(0.2)

    def forward(self,key,query,value):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           norm2_out: output of transformer block
        
        """
        
        attention_out = self.attention(key,query,value)  #32x10x512
        attention_residual_out = attention_out + value  #32x10x512
        norm1_out = self.dropout1(self.norm1(attention_residual_out)) #32x10x512

        feed_fwd_out = self.feed_forward(norm1_out) #32x10x512 -> #32x10x2048 -> 32x10x512
        feed_fwd_residual_out = feed_fwd_out + norm1_out #32x10x512
        norm2_out = self.dropout2(self.norm2(feed_fwd_residual_out)) #32x10x512

        return norm2_out

class TransformerEncoder(nn.Module):
    """
    Args:
        seq_len : length of input sequence
        embed_dim: dimension of embedding
        num_layers: number of encoder layers
        expansion_factor: factor which determines number of linear layers in feed forward layer
        n_heads: number of heads in multihead attention
        
    Returns:
        out: output of the encoder
    """
    def __init__(self, seq_len, vocab_size, embed_dim, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerEncoder, self).__init__()
        
        self.embedding_layer = Embedding(vocab_size, embed_dim)
        self.positional_encoder = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList([TransformerBlock(embed_dim, expansion_factor, n_heads) for i in range(num_layers)])

		def forward(self, x):
        embed_out = self.embedding_layer(x)
        out = self.positional_encoder(embed_out)
        for layer in self.layers:
            out = layer(out,out,out)

        return out  #32x10x512

Decoder

class DecoderBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(DecoderBlock, self).__init__()

        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = MultiHeadAttention(embed_dim, n_heads=8)
        self.norm = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(0.2)
        self.transformer_block = TransformerBlock(embed_dim, expansion_factor, n_heads)

		def forward(self, key, query, x,mask):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           mask: mask to be given for multi head attention 
        Returns:
           out: output of transformer block
    
        """
        
        #we need to pass mask mask only to fst attention
        attention = self.attention(x,x,x,mask=mask) #32x10x512
        value = self.dropout(self.norm(attention + x))
        
        out = self.transformer_block(key, query, value)

        
        return out

class TransformerDecoder(nn.Module):
    def __init__(self, target_vocab_size, embed_dim, seq_len, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerDecoder, self).__init__()
        """  
        Args:
           target_vocab_size: vocabulary size of taget
           embed_dim: dimension of embedding
           seq_len : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        self.word_embedding = nn.Embedding(target_vocab_size, embed_dim)
        self.position_embedding = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_dim, expansion_factor=4, n_heads=8) 
                for _ in range(num_layers)
            ]

        )
				self.fc_out = nn.Linear(embed_dim, target_vocab_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x, enc_out, mask):
        
        """
        Args:
            x: input vector from target
            enc_out : output from encoder layer
            trg_mask: mask for decoder self attention
        Returns:
            out: output vector
        """
            
        
        x = self.word_embedding(x)  #32x10x512
        x = self.position_embedding(x) #32x10x512
        x = self.dropout(x)

				for layer in self.layers:
            x = layer(enc_out, x, enc_out, mask) 

        out = F.softmax(self.fc_out(x))

        return out
class Transformer(nn.Module):
    def __init__(self, embed_dim, src_vocab_size, target_vocab_size, seq_length,num_layers=2, expansion_factor=4, n_heads=8):
        super(Transformer, self).__init__()
        
        """  
        Args:
           embed_dim:  dimension of embedding 
           src_vocab_size: vocabulary size of source
           target_vocab_size: vocabulary size of target
           seq_length : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        
        self.target_vocab_size = target_vocab_size

        self.encoder = TransformerEncoder(seq_length, src_vocab_size, embed_dim, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)
        self.decoder = TransformerDecoder(target_vocab_size, embed_dim, seq_length, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)

		def make_trg_mask(self, trg):
        """
        Args:
            trg: target sequence
        Returns:
            trg_mask: target mask
        """
        batch_size, trg_len = trg.shape
        # returns the lower triangular part of matrix filled with ones
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            batch_size, 1, trg_len, trg_len
        )
        return trg_mask

		def decode(self,src,trg):
        """
        for inference
        Args:
            src: input to encoder 
            trg: input to decoder
        out:
            out_labels : returns final prediction of sequence
        """
        trg_mask = self.make_trg_mask(trg)
        enc_out = self.encoder(src)
        out_labels = []
        batch_size,seq_len = src.shape[0],src.shape[1]
        #outputs = torch.zeros(seq_len, batch_size, self.target_vocab_size)
        out = trg
        for i in range(seq_len): #10
            out = self.decoder(out,enc_out,trg_mask) #bs x seq_len x vocab_dim
            # taking the last token
            out = out[:,-1,:]

						out = out.argmax(-1)
            out_labels.append(out.item())
            out = torch.unsqueeze(out,axis=0)
          
        
        return out_labels
    
    def forward(self, src, trg):
        """
        Args:
            src: input to encoder 
            trg: input to decoder
        out:
            out: final vector which returns probabilities of each target word
        """
        trg_mask = self.make_trg_mask(trg)
        enc_out = self.encoder(src)
   
        outputs = self.decoder(trg, enc_out, trg_mask)
        return outputs

總結

Transformer模型是發展行之有年的深度學習模型,已取代CNN及RNN。細看其結構,前半部為encoder,後半部為decoder,且使用自注意力機制,使訓練效率大大提高。過去訓練CNN及RNN需要的大量數據,在訓練Transformer時大幅減少,這使得機器翻譯的發展快速躍進。


上一篇
DAY23 - 如何訓練HuBERT模型
下一篇
DAY25 - 如何訓練Conformer模型
系列文
利用SeamlessM4T學習語音辨識架構及應用30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言